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Abstract 

Background: Models that are capable of reliably predicting binding affinities for protein-ligand complexes play an 
important role the field of structure-guided drug design. 

Methods: Here, we begin by applying the computational geometry technique of Delaunay tessellation to each set 
of atomic coordinates for over 1400 diverse macromolecular structures, for the purpose of deriving a four-body 
statistical potential that serves as a topological scoring function. Next, we identify a second, independent set of 
three hundred protein-ligand complexes, having both high-resolution structures and known dissociation constants. 
Two-thirds of these complexes are randomly selected to train a predictive model of binding affinity as follows: two 
tessellations are generated in each case, one for the entire complex and another strictly for the isolated protein 
without its bound ligand, and a topological score is computed for each tessellation with the four-body potential. 
Predicted protein-ligand binding affinity is then based on an empirically derived linear function of the difference 
between both topological scores, one that appropriately scales the value of this difference. 

Results: A comparison between experimental and calculated binding affinity values over the two hundred 
complexes reveals a Pearson's correlation coefficient of r = 0.79 with a standard error of SE = 1.98 kcal/mol. To 
validate the method, we similarly generated two tessellations for each of the remaining protein-ligand complexes, 
computed their topological scores and the difference between the two scores for each complex, and applied the 
previously derived linear transformation of this topological score difference to predict binding affinities. For these 
one hundred complexes, we again observe a correlation of r = 0.79 {SE = 1.93 kcal/mol) between known and 
calculated binding affinities. Applying our model to an independent test set of high-resolution structures for three 
hundred diverse enzyme-inhibitor complexes, each with an experimentally known inhibition constant, also yields a 
correlation of r = 0.79 {SE = 2.39 kcal/mol) between experimental and calculated binding energies. 

Conclusions: Lastly, we generate predictions with our model on a diverse test set of one hundred protein-ligand 
complexes previously used to benchmark 15 related methods, and our correlation of r = 0.66 between the 
calculated and experimental binding energies for this dataset exceeds those of the other approaches. Compared 
with these related prediction methods, our approach stands out based on salient features that include the 
reliability of our model, combined with the rapidity of the generated predictions, which are less than one second 
for an average sized complex. 



Correspondence: mmasso@gmu.edu 

Laboratory for Structural Bioinformatics, School of Systems Biology, George 
Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia 20110, 
USA 

© 201 3 Masso; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution 
License (http://creativecommons.Org/Iicenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, 
provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ 
publicdomain/zero/l.O/) applies to the data made available in this article, unless othen/^ise stated. 



(3 Bion/led Central 



Masso BMC Structural Biology 2013, 13(Suppl 1):S1 
http://www.biomedcentral.com/1472-6807/13/S1/S1 



Page 2 of 9 



Background 

Experimental high-throughput screening processes that 
drive structure-guided drug design efforts are effective 
tools for the identification of candidate molecular ligands 
that may tightly bind a target protein; however, such an 
approach often proves to be a costly endeavor, in terms 
of both time and financial expense, one that can poten- 
tially be alleviated with reliable in silico protein-ligand 
binding affinity models to assist in winnowing the search 
space [1]. A diverse array of computational approaches to 
model binding affinity have been described in the litera- 
ture, each of which focuses on a unique combination of 
physicochemical properties and interactions: X-Score [2], 



Lig-Score [3], DrugScore [4], SFCscore [5], AutoDock4 
[6], ITScore [7,8], and PHOENIX [9] are just a few exam- 
ples of such predictive tools. Here we describe our devel- 
opment of a model for predicting protein-ligand binding 
energy that relies on Delaunay tessellation, a computa- 
tional geometry technique [10], for the purpose of objec- 
tively capturing nearest neighbor atomic four-body 
interactions in the structures of macromolecular com- 
plexes (Figure 1). 

First, we compute the propensities for occurrence of all 
atomic quadruplet interactions by applying the tessella- 
tion procedure to atomic coordinates for a diverse cross- 
section of over 1400 high-resolution macromolecular 




Figure 1 Atomic Delaunay tessellation of the hen egg-white lysozyme (a) in complex with bound ligand NAG (N-Acetyl-D- 
Glucosamine) and (b) without the bound ligand (PDB accession code: IHEW). 
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crystal structures, and the data collected are used in gen- 
erating an atomic four-body potential. Tasked with dis- 
tinguishing native structures as having global energy 
minima relative to decoys, our knowledge-based potential 
performs well compared to several related atomic energy 
functions [11,12]; however this work constitutes substan- 
tial research outside the immediate focus of this study, 
and accordingly it will be reported elsewhere. Next, we 
apply our atomic potential to a separate dataset of three 
hundred diverse protein-ligand complexes, each selected 
for having both a solved high-resolution crystal structure 
and a known dissociation constant {kd), the latter quan- 
tity being useful for determining the Gibbs free energy of 
binding (AG). Two thirds of the complexes are randomly 
selected to train our predictive model of binding affinity: 
in each case, the entire complex is tessellated and then 
scored using the four-body potential, as is the structure 
of the isolated protein without its bound ligand, and we 
derive an empirical linear function of the difference 
between these scores to predict AG values. The remain- 
ing one hundred complexes are then used to validate the 
capability of the trained linear model to predict binding 
energies for new protein-ligand complexes. 

The steps taken to develop our model formed the basis 
of a recently published companion study [13], and here 
we begin by carefully outlining those details below, since 
they lay the foundation for the next stage of the work to 
be presented. In particular, the model is subsequently 
applied to the prediction of binding affinities for an 
independent, diverse test set of three hundred enzyme- 
inhibitor complexes for which high-resolution crystal 
structures, as well as experimentally determined inhibition 
constants (/c,), are available. Also, model performance 
is comprehensively benchmarked against a number of 
related methods from the literature. 

Methods 

Datasets 

High-resolution (< 2.2A) crystallographic structures for 
1417 macromolecular complexes (Additional file 1), culled 
using the PISCES server [14] and having protein chains 
that share low (< 30%) sequence identity, were selected to 
develop the four-body statistical potential. Dataset diver- 
sity is also reflected in the fact that the complexes consist 
of both single chain and multimeric proteins, many of 
which have bound ligands in the form of either small 
molecules or peptides. Each complex has a coordinate file 
deposited in the Protein Data Bank (PDB) [15], and fol- 
lowing the removal of all hydrogen atoms and water mole- 
cules, Delaunay tessellation is applied to each structure file 
by using all the remaining atomic coordinates. 

In order to train and validate our model for predicting 
binding affinity, we selected another diverse set of three 
hundred protein-ligand complexes (Additional file 2) 



from the Binding MOAD [16,17] database. The Binding 
MOAD is a repository for all protein-ligand complexes 
that have high-resolution (< 2.5A) structures deposited 
in the PDB, and if available, published experimental 
binding energy data. Focusing specifically on a non- 
redundant subset of the Binding MOAD, both to ensure 
diversity of complexes as well as to minimize bias due to 
over-represented structures, we identified three hundred 
complexes having both PDB coordinate files as well as 
experimental dissociation constants {ka). The PDB acces- 
sion codes and k^ values for the protein-ligand complexes 
are tabulated in Additional file 2, as is the identity of the 
subset (200 for training, and 100 for validation) into 
which each is randomly placed. 

Software and performance measurements 

We use the Qhull software package [18] to carry out the 
atomic Delaunay tessellations, Matlab (Version 7.0.1.24704 
(R14) Service Pack 1) to produce graphical depictions of 
the tessellations, and the UCSF Chimera software package 
[19] to generate all other molecular visualizations in this 
study. Codes to perform all data formatting and analyses 
tasks are written in the Perl programming language. 

Given the dissociation constant (kj) for a protein-ligand 
complex, the standard Gibbs free energy of binding (AG, 
in units of kcal/mol) can be determined using 



AG = RT\n{kd) = 0.592 x \n{kd), 



(1) 



where R = 1.986 xlO'^ kcal IC^ mol'^ is the gas constant 
and T = 298° K is the absolute temperature. We evaluate 
the agreement between known (xi) and predicted (j/) 
binding energies by reporting the Pearson's correlation 
coefficient 



the standard error of the predictions 



(2) 



(3) 



and the equation of the fitted regression line. 



Results 

Four-body statistical potential 

To generate our knowledge-based potential, a six-letter 
alphabet (C, N, O, S, M = all metals, X = all other non- 
metals) is used for labelling all atoms (excluding hydro- 
gens and water molecules). The Qhull software uses the 
3-dimensional (3D) coordinates of atoms in a PDB file 
to generate a Delaunay tessellation of the structure, a 
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space-filling convex hull formed by hundreds of solid, 
non-overlapping, irregular tetrahedra whose vertices are 
the 3D atomic points. Each atom serves as a vertex, with 
most being shared by numerous adjacent tetrahedra, and 
every tetrahedral simplex objectively identifies a quadru- 
plet of nearest neighbor atoms at its four vertices. To 
ensure this is indeed the case, we eliminate all edges 
longer than SA immediately upon tessellation, which is in 
agreement with related research in this arena at the atomic 
[20] and residue [21,22] levels of coarse-graining. The 
combined total number of tetrahedra remaining for analy- 
sis after tessellating the 1417 PDB coordinate files is pro- 
vided in Table 1, as are the total number of atoms of each 
type as well as their relative frequencies. 

Without regards to the ordering of a quadruplet of 
atoms (i.e, all permutations of the four letters are non- 
unique and represent the same quadruplet), and allowing 
for the repeated occurrence of atom types in any given 
quadruplet (i.e., letters may appear more than once in a 
quadruplet), there are 126 possible types of atomic quad- 
ruplets that can be enumerated based on the use of a 
6-letter atomic alphabet (Table 2). For each quadruplet 
(i,i,k,I), we define as the observed proportion of all tet- 
rahedral simplices obtained by tessellating all 1417 struc- 
tures to have those four types of atoms at the vertices; 
similarly, we let piji^i represent the rate expected by chance, 
which is based on relative frequencies of the six atom 
types in the structures (Table 1) and calculated using a 
multinomial background distribution given by 

41 6 <= 6 

Pm = -g— — n 'Jn -where ^ fl„ = 1 and ^ t„ = 4. 
n (t„!) "=i "=i 

In Eq. (4), a„ is the relative frequency of atom type n, 
while ^„ counts how many times atom type n appears in 
the quadruplet (i,j,k,l). As a consequence of the inverted 
Boltzmann principle [23], the score Si^^i = log (^vt; / Pi^ki) is 
proportional to the energy of quadruplet atomic interac- 
tion, and the set of 126 scored atomic quadruplets defines 
our four-body statistical potential (Table 2). 



Table 1 Summary data for the 1417 PDB structure files. 



Atom Types 


Count 


Proportion 


C 


3,612,988 


0.633193 


N 


969,253 


0.169866 


0 


1,088,410 


0.190749 


S 


28,502 


0.004995 


(all metals) M 


2,529 


0.000443 


(all other non-metals) X 


4,299 


0.000754 


Total atom count: 


5,705,981 




Total tetrahedron count: 


34,504,737 





Topological scores 

In order to develop our predictive model, the four-body 
potential is applied to the dataset of three hundred pro- 
tein-ligand complexes compiled from the Binding MOAD 
in the following manner. For each complex, the atomic 
coordinates (excluding hydrogens and water molecules) in 
the PDB file are tessellated (edges longer than sA 
removed), each tetrahedron in the tessellation is scored 
using Table 2 according to the four atoms at its vertices, 
and a normalized topological score (Q) is calculated to be 
the sum of all the tetrahedral scores divided by the num- 
ber of tetrahedra in the tessellation, a quantity that can be 
summarized compactly by the equation 

Q = i E (5) 

Next, atomic coordinates of the ligand are removed 
from the PDB file of the complex, and the procedure is 
repeated to compute Q for the isolated protein (Figure 1). 
Lastly, we define the topological score difference 

= Qcomplex Qprotein (6) 

for the complex. In the next section, we compare com- 
puted AQ quantities with known AG values for these 
complexes in order to develop a model for predicting 
binding energy. An important underlying assumption in 
this formulation is that ligand size is small enough so 
that tetrahedra formed at the interface with the protein 
dominate purely internal atomic interactions within the 
ligand. The calculated Q values, for structures of the 
three hundred protein-ligand complexes, as well as the 
isolated proteins without their bound ligands, are tabu- 
lated in Additional file 2. 

Predictive model of binding energy 

A comparison of the calculated AQ quantities for our 
training set of two hundred randomly selected complexes 
with their experimental AG values (AGg^p) reveals a corre- 
lation coefficient of r = 0.79. However, the AQ values are 
not uniform in sign, and they are on a significantly smaller 
scale relative to the standard Gibbs free energy of binding 
(AGexp) data; hence, they cannot be used directly as a 
representation of predicted AG values (AGcaic)- Both issues 
related to AQ values for the training data are addressed 
with an empirically derived linear function 

AGcaic = (1/0.0003) X AQ - 10.49, (7) 

resulting in negative AGcaic values in each case that also 
scale similarly to AGexp- Owing to AGcaic arising from a 
simple linear transformation of the AQ values, AGcaic and 
AGexp also display a correlation of r = 0.79 (SE = 1.98 kcal/ 
mol) with a fitted regression line of j = 1.18a; (Figure 2). 
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Table 2 Atomic four-body statistical potential. 



Quad 


Count 




Pijkl 




Quad 


Count 


fjjld 


Pijkl 




cccc 


4015872 


0.116386 


0.160748 


-0.140244 


MMNS 


363 


1.05 E-05 


2.00E-09 


3.720958 


CCCM 


1592 


4.61 E-05 


0.000450 


-0.989223 


MMNX 


0 


0 


3.02E-10 




CCCN 


4025206 


0.1 16657 


0.1 72495 


-0.169866 


MMOO 


306 


8.87E-06 


4.29E-08 


2.315530 


ccco 


6202159 


0.179748 


0.193701 


-0.032467 


MMOS 


104 


3.01 E-06 


2.25E-09 


3.127729 


cccs 


293157 


0.008496 


0.005072 


0.224008 


MMOX 


3 


8.69E-08 


3.39E-10 


2.409325 


cccx 


2796 


8.10 E-05 


0.000765 


-0.975047 


MMSS 


254 


7.36E-06 


2.94E-1 1 


5.398477 


CCMM 


132 


3.83E-06 


4.73E-07 


0.908235 


MMSX 


2 


5.80E-08 


8.87E-1 2 


3.815151 


CCMN 


3318 


9.62E-05 


0.000362 


-0.575981 


MMXX 


0 


0 


6.69E-1 3 




CCMO 


5325 


0.0001 54 


0.000407 


-0.420893 


MNNN 


1030 


2.99E-05 


8.69E-06 


0.535960 


CCMS 


2293 


6.65E-05 


1 .07E-05 


0.795108 


MNNO 


1128 


3.27E-05 


2.93E-05 


0.047955 


CCMX 


15 


4.35E-07 


1.61 E-06 


-0.567697 


MNNS 


561 


1 .63E-05 


7.67E-07 


1.326526 


CCNN 


1797552 


0.052096 


0.069412 


-0.124635 


MNNX 


5 


1 .45E-07 


1.16E-07 


0.098041 


CCNO 


8233136 


0.238609 


0.155892 


0.184864 


MNOO 


3744 


0.000109 


3.29E-05 


0.518626 


CCNS 


124653 


0.003613 


0.004082 


-0.053081 


MNOS 


314 


9.10E-06 


1 .72E-06 


0.723107 


CCNX 


2007 


5.82E-05 


0.000616 


-1 .024729 


MNOX 


29 


8.40E-07 


2.60E-07 


0.510083 


ccoo 


3366568 


0.097568 


0.087528 


0.047161 


MNSS 


793 


2.30E-05 


2.25E-08 


3.008398 


ccos 


1 98630 


0.005757 


O.OO'158'I 


0.098905 


MNSX 


5 


1 .45E-07 


6.80E-09 


1.328573 


ccox 


4626 


0.0001 34 


0.000691 


-0.712426 


MNXX 


9 


2.61 E-07 


5.13E-10 


2.706383 


cess 


15288 


0.000443 


6.00E-05 


0.8681 58 


MOOO 


5430 


0.000157 


1 .23 E-05 


1.106856 


ccsx 


144 


4.17E-06 


1.81 E-05 


-0.637352 


MOOS 


156 


4.52E-06 


9.67E-07 


0.669977 


ccxx 


143 


4.14E-06 


1 .37E-06 


0.482159 


MOOX 


168 


4.87E-06 


1 .46E-07 


1 .523669 


CMMM 


23 


6.67E-07 


2.21 E-10 


3.480397 


MOSS 


210 


6.09E-06 


2.53E-08 


2.380989 


CMMN 


144 


4.17E-06 


2.54E-07 


1.216422 


MOSX 


4 


1.16E-07 


7.64E-09 


1.181307 


CMMO 


256 


7.42 E-06 


2.85E-07 


1.415945 


MOXX 


55 


1 .59E-06 


5.76E-10 


3.442148 


CMMS 


662 


1 .92E-05 


7.46E-09 


3.410480 


MSSS 


62 


1 .80E-06 


2.21 E-10 


3.910199 


CMMX 


1 


2.90E-08 


1.12E-09 


1.41 1 130 


MSSX 


2 


5.80E-08 


1 .OOE-1 0 


2.763224 


CMNN 


2474 


7.17E-05 


9.72E-05 


-0.132029 


MSXX 


0 


0 


1.51 E-11 




CMNO 


6267 


0.0001 82 


0.000218 


-0.079754 


MXXX 


16 


4.64E-07 


7.58E-1 3 


5.786451 


CMNS 


2588 


7.50E-05 


5.72E-06 


1.1 18068 


NNNN 


3878 


0.0001 1 2 


0.000833 


-0.869698 


CMNX 


26 


7.54E-07 


8.62E-07 


-0.05841 5 


NNNO 


46665 


0.001352 


0.003740 


-0.441 730 


CMOO 


8481 


0.000246 


0.000123 


0.302308 


NNNS 


460 


1.33E-05 


9.79E-05 


-0.866046 


CMOS 


1010 


2.93 E-05 


6.42E-06 


0.659069 


NNNX 


34 


9.85E-07 


1 .48E-05 


-1.175817 


CMOX 


68 


1.97E-06 


9.68E-07 


0.308765 


NNOO 


340620 


0.009872 


0.006299 


0.195102 


CMSS 


2047 


5.93E-05 


8.40E-08 


2.84881 3 


NNOS 


5637 


0.000163 


0.000330 


-0.305233 


CMSX 


13 


3.77E-07 


2.53E-08 


1.1 721 17 


NNOX 


302 


8.75E-06 


4.98E-05 


-0.754766 


CMXX 


5 


1 .74E-07 


1.91 E-09 


1 .958862 


NNSS 


31 1 


9.01 E-06 


4.32E-06 


0.319427 


CNNN 


1 02035 


0.002957 


0.012414 


-0.623046 


NNSX 


6 


1 .74E-07 


1 .30E-06 


-0.874705 


CNNO 


1995038 


0.057819 


0.041821 


0.140679 


NNXX 


5 


1.45 E-07 


9.83E-08 


0.168652 


CNNS 


15892 


0.000461 


0.001095 


-0.376176 


NOOO 


171 147 


0.004960 


0.004716 


0.021937 


CNNX 


578 


1 .68E-05 


0.000165 


-0.993919 


NOGS 


10697 


0.000310 


0.000370 


-0.077374 


CNOO 


2734639 


0.079254 


0.046962 


0.227273 


NOOX 


3102 


8995-05 


5.59E-05 


0.206513 


CMOS 


95438 


0.002766 


0.002460 


0.050981 


NOSS 


922 


2.67E-05 


9.70E-06 


0.440012 


CNOX 


2168 


6.28E-05 


0.000371 


-0.771 1 73 


NOSX 


12 


3.48E-07 


2.93E-06 


-0.925060 


CNSS 


4264 


0.0001 24 


3.22E-05 


0.584024 


NOXX 


61 


1 .77E-06 


2.21 E-07 


0.903627 




D7 
J/ 


1 .U/t-UD 


y./ 1 t-Uo 


n of^7i 1 ^ 
-u.yj / 1 1 J 




jj 


Q t^AP n7 


p /1 7P nQ 




CNXX 


61 


1.77E-06 


7.33E-07 


0.382553 


NSSX 


0 


0 


3.83E-08 




COOO 


524994 


0.015215 


0.017579 


-0.062707 


NSXX 


0 


0 


5.78E-09 




COOS 


34429 


0.000998 


0.001381 


-0.141141 


NXXX 


3 


8.69E-08 


2.91 E-10 


2475964 


COOX 


23801 


0.000690 


0.000208 


0.520038 


0000 


34212 


0.000992 


0.001324 


-0.125549 


COSS 


4380 


0.000127 


3.62E-05 


0.545326 


OOOS 


4240 


0.0001 23 


0.000139 


-0.052504 


cosx 


58 


1 .68E-06 


1.09E-05 


-0.812243 


OOOX 


9553 


0.000277 


2.09E-05 


1.121777 


coxx 


65 


1.88E-06 


8.23E-07 


0.359781 


OOSS 


300 


8.69E-06 


5.45E-06 


0.203077 


csss 


285 


8.26E-06 


3.16E-07 


1 41 7735 


OOSX 


36 


1.04E-06 


1 .64E-06 


-0.197264 
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Table 2 Atomic four-body statistical potential. (Continued) 



cssx 


5 


1.45E-07 


1 .43 E-07 


0.00624/ 


ooxx 


128 


3.71 E-06 


1 .24E-07 


1 .4761 81 


csxx 


4 


1.16E-07 


2.15E-0B 


0.730845 


osss 


38 


1.10E-06 


9.51 E-08 


1 .063748 


cxxx 


9 


2.61 E-07 


1 .08E-09 


2.381656 


ossx 


3 


8.69E-08 


4.30E-08 


0.305472 


MMMM 


83 


2.41 E-06 


3.85E-14 


7.794725 


osxx 


0 


0 


649E-09 




MMMN 


37 


1.07E-06 


5.92E-1 1 


4.258301 


oxxx 


2 


5.80E-08 


3.26E-10 


2.249518 


MMMO 


29 


8.40E-07 


6.64E-1 1 


4.102142 


ssss 


6 


1 .74E-07 


6.23E-10 


2.446092 


MMMS 


379 


1.10E-05 


1 .74E-1 2 


6.800300 


sssx 


0 


0 


3.76E-10 




MMMX 


0 


0 


2.52E-13 




ssxx 


0 


0 


8.50E-1 1 




MMNN 


83 


241 E-06 


3.40E-08 


1 .849597 


sxxx 


0 


0 


8.55E-12 




MMNO 


102 


2.96E-06 


7.64E-08 


1.587734 


xxxx 


0 


0 


3.22E-13 





Turning next to the validation set of one hundred com- 
plexes, we obtain AGcaic values from their computed AQ 
quantities by utilizing the linear model given in Eq. (7) 
that we empirically derived from the training data. The 
predicted AGcaic and knovm AGexp values for these com- 
plexes again display a correlation of r = 0.79 {SE = 1.93 
kcal/mol) with a fitted regression line oiy = - 0.63, 
and a scatter plot of the validation data is superimposed 
over that of the training data in Figure 2. Tabulated in 
Additional file 2 are AGgxp and AGcaic values for all three 
hundred protein-ligand complexes. 

Discussion 

Enzyme-inhibitor binding affinity prediction 

In order to test the utility of our model through a practi- 
cal application, we predict binding affinities for a diverse 
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Figure 2 Scatter plot of calculated versus experimental binding 
energies for the dataset of three hundred proteln-llgand 
complexes. 



dataset of three hundred enzyme-inhibitor complexes 
(Additional file 3), independent of those protein-ligand 
complexes used for training and validation, which are 
annotated with their respective experimental inhibition 
constants (/c,) in the non-redundant Binding MOAD. 
Analogous to Eq. (1), we obtain the standard Gibbs free 
energy of binding for each complex with the equation 



AGexp = RTln(fei) = 0.592 x In(fei), 



(8) 



where R = 1.986 xlO"^ kcal K'^ mol'^ is the gas constant 
and T = 298° K is the absolute temperature. Tabulated in 
Additional file 3 are the PDB accession codes of the high- 
resolution (< 2.5A) crystallographic structures, as well as 
the ki and AGexp values, corresponding to these enzyme- 
inhibitor complexes. 

Next, we use the atomic coordinates (hydrogen atoms 
and water molecules excluded) provided by the PDB struc- 
ture file for each complex to generate a Delaunay tesse- 
llation (subject to an SA edge-length cutoff), from which 
we obtain a normalized topological score (Qcompiex) 
by employing Eq. (5) in conjunction with our atomic four- 
body statistical potential (Table 2). In a similar fashion, we 
generate a normalized topological score for the isolated 
protein without the bound inhibitor (Qprotein)> by tessellat- 
ing a modified version of the PDB file that excludes the 
atomic coordinates for the inhibitor. Lastly, we calculate 
the difference (AQ) between these normalized topological 
scores according to Eq. (6), which is subsequently used 
by our model in Eq. (7) to yield a prediction for the 
enzyme-inhibitor binding affinity (AGcaic)- AH normalized 
topological score and calculated binding affinity data are 
also tabulated in Additional file 3. 

For this dataset of three hundred enzyme-inhibitor com- 
plexes, the calculated AQ values and the experimental 
binding affinity AGexp data display a correlation of r = 
0.79; likewise, as discussed previously, the correlation 
between AGcaic and AGexp is similarly given by r = 0.79, in 
this case with a calculated standard error for the predic- 
tions of SE = 2.39 kcal/mol (Figure 3). 
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Figure 3 Scatter plot of calculated versus experimental binding 
energies for the dataset of three hundred enzyme-inhibitor 
complexes. 



Comparisons to related methods 

In the same way that our predictive model of protein- 
Ugand binding affinity is evaluated on a test set of three 
hundred enzyme-inhibitor complexes as described in the 
previous section, other related methods similarly use test 
sets of complexes to validate their models. Hence, to 
directly compare our performance to that of other meth- 
ods, binding affinity predictions are generated using our 
approach for complexes that form their test sets. Starting 
with X-Score, Wang et al. [2] report predictions with their 
model on a test set of ten complexes that reflect a correla- 
tion of r = 0.67 between experimental and predicted bind- 
ing affinity (right hand columns of Table 3 in [2], predicted 
data are in parentheses), with a fitted regression line oiy = 
Q.'ilx + 3.78. On the identical dataset, predictions obtained 
with our model yield a correlation of r = 0.72 and fitted 
regression line oi y = l.26x - 1.20, results that signify a 
clear improvement over those of X-Score (Table 3 of this 
manuscript, which also reproduces the X-Score data). 

Turning next to ITScore, we discover that Huang et al. 
[8] utilize a benchmarking test set consisting of one hun- 
dred protein-ligand complexes, originally constructed by 
Wang et al. [24], to compare their scoring function and 14 
other methods by ranking the respective Pearson's correla- 
tion coefficients (r) between experimental and predicted 
binding affinities. The test set is diverse, consisting of 43 
different proteins as well as binding affinities that span 
nearly nine orders of magnitude. By generating binding 



Table 3 Comparing experimental binding affinity values 
for 1 0 proteln-llgand complexes with predicted values 
obtained using both X-Score and the model developed In 
this study. 



PDB code 


Exp. 


X-Score 


Exp. 


Our model 


1ABE 


6.52 


5.25 


-8.887610045 


-10.3129 


1ADB 


8.40 


8.01 


-11.45029515 


-23.3388 


1ADD 


6.74 


5.36 


-9.187498728 


-13.1243 


1AF2 


3.10 


4.90 


-4.225704163 


-10.0826 


1ANF 


546 


6.03 


-7.442691848 


-9.59131 


1CBX 


6.35 


5.74 


-8.655877882 


-11.6284 


1DBM 


9.44 


6.65 


-12.86795074 


-14.9936 


1DHF 


7.40 


5.27 


-10.08716478 


-13.1753 


1GST 


468 


5.21 


-6.379450155 


-5.48917 


1HPV 


9.22 


6.28 


-12.56806206 


-16.0903 




X-Score: r = 0.67 


Our model: 


! = 0.72 



Experimental AG data for the complexes are derived from the experimental 
p/cd values. 



affinity predictions for these one hundred complexes with 
our model and calculating their correlation with the experi- 
mental data, we can subsequently determine our ranking 
among these 15 related approaches: ITScore [8], X-Score 
[2], DFIRE [25], DrugScore'^^^ [26], DrugScore™'* [4], 
Cerius2/PLP [27,28], SYBYL/G-Score [29], SYBYL/D-Score 
[30], SYBYL/ChemScore [31], Cerius2/PMF [32], DOCK/ 
FF [30], Cerius2/LUDI [33,34], Cerius2/Lig-Score [35], 
SYBYL/F-Score [36], and AutoDock [37]. The results of 
our predictions are summarized in Figure 4, which provides 
a scatter plot of calculated versus experimental binding 
energies for this dataset of one hundred complexes. The 
plot reflects a correlation of r = 0.66 (r = 0.67 with one out- 
lier complex excluded), surpassing all of the other methods 
(Table 4, data for the other methods are obtained from 
Table 3 in [8]) and validating the reliability of our approach. 

Conclusions 

Delaunay tessellation of atomic coordinates in a diverse 
dataset of macro molecular structures objectively identi- 
fies four-body atomic interactions, providing the raw 
data for developing a knowledge-based atomic four-body 
statistical contact potential. This potential is used to 
score a separate diverse set of three hundred protein- 
ligand complexes with known binding affinities, as well 
as to score the isolated proteins without their bound 
ligands, based on their respective structure tessellations. 
Initially, the difference (AQ) between scores calculated 
for an entire complex and for its isolated protein is con- 
sidered as a predictor of binding affinity; however, since 
these AQ do not scale as binding free energy values, two 
hundred randomly selected protein-ligand complexes 
from this set are used to empirically derive a linear 
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Figure 4 Scatter plot of calculated versus experimental binding 
energies using our model for the benchmarking dataset of one 
hundred enzyme-inhibitor complexes. 



model is then used to predict binding energies for an 
independent and diverse test set of three hundred 
enzyme-inhibitor complexes, producing results that are 
consistent with those based on the training and valida- 
tion data. Finally, we utilize a diverse test set of one hun- 
dred protein-ligand complexes to benchmark the binding 
energy predictions made with our model, and our corre- 
lation between calculated and experimental binding ener- 
gies for this dataset surpasses those of all 15 related 
methods to which it is compared. A key advantage with 
our approach is the ability to generate rapid predictions, 
typically under one second per complex. 

Additional material 



Additional file 1: PDB accession codes for the 1417 macromolecular 
structures used to derive the atomic four-body statistical potential, httpy/ 

proteins.gmu.edu/automute/Additional file l.txt 

Additional file 2: Three hundred protein-ligand complexes used to train 
and validate the model. httpy/proteins.gmu.edu/automute/Additional file 

2.txt 

Additional file 3: Independent set of three hundred enzyme-inhibitor 
complexes used to test the model, http://proteins.gmu.edu/automute/ 
Additional file 3.txt 



function of AQ as a model for calculating the binding 
energy. For this training set, we observe a correlation of r 
= 0.79 between calculated and experimental binding 
energies, with a standard error of SE = 1.98 kcal/mol and 
a regression line of y = 1.18x. Validation of this model 
with the remaining one hundred complexes that were 
held out yields performance measures of r = 0.79 and SE 
= 1.93 kcal/mol. In an application of the method, our 



Table 4 Benchmarking correlations between calculated 
and experimental binding energies using our model and 
15 related methods on 100 protein-ligand complexes. 



Method 


Type 


Correlation coefficient (r) 


Our model 


Knowledge-based 


0.66 


ITScore 


Iterative score 


0,65 


X-Score 


Fmpirica 


0.64 


DFIRE 


Knowledge-based 


0.63 


DrugScore°° 


Knowledge-based 


0.62 


DrugScore''™ 


Knowledge-based 


0.60 


Cerius2/PLP 


Ennpirical 


0.56 


SYBYL/G-Score 


Force-field-based 


0.56 


SYBYL/D-Score 


Force-field-based 


048 


SYBYiyChemScore 


Ennpirical 


047 


Cerius2/PMF 


Knowledge-based 


0.40 


DOCK/FF 


Force field 


040 


Cerius2/LUDI 


Empirical 


0.36 


Cerius2/Lig-Score 


Force-field-based 


0.35 


SYBYL/F-Score 


Fmpirical 


0.30 


AutoDock 


Force-field-based 


0.05 
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